Skip to content
Guide

AI Agent Blameless Postmortem Template

Capture system causes and actionable fixes after incidents without reducing analysis to blame.

Guide11 min readMar 2026
TL;DR
  • -Blameless does not mean low accountability. It means system-focused accountability.
  • -A postmortem without corrective owners and due dates is documentation theater.
  • -Autonomous-agent postmortems need policy and replay evidence, not just uptime graphs.
  • -Ask what and how questions before asking who or why.
System focus

Capture environmental conditions and decision context, not personal blame narratives.

Action quality

Track corrective actions with owner, priority, and verification signal.

Operational depth

Include lock state, policy path, and replay evidence for distributed agent workflows.

Scope

This template is for autonomous AI agent incidents where queueing, retries, policy checks, and distributed coordination all influence the failure path.

The production problem

Teams often complete incident reviews quickly and still repeat the same class of outage. The missing piece is usually not effort. It is evidence quality.

Generic postmortem templates capture timeline and impact, but autonomous-agent systems need extra detail: policy path, replay integrity, and lock behavior under load.

If these fields are missing, corrective actions become broad refactors instead of targeted reliability fixes.

What top results miss

SourceStrong coverageMissing piece
Google SRE: Postmortem CultureExcellent blameless philosophy and criteria for when postmortems should be written.No concrete template fields for policy-gated autonomous execution paths.
Atlassian: Incident Postmortem TemplateStrong section-by-section template with Five Whys and corrective actions.No distributed-lock or replay-integrity checks for queue-driven AI control planes.
PagerDuty: The Blameless PostmortemGreat guidance on blame-aware language and cognitive bias countermeasures.No metric-backed evidence checklist for modern agent orchestration stacks.

Template structure

A useful postmortem template reduces ambiguity during writing and increases cross-incident comparability. Keep sections stable across incidents.

SectionMust includeWeak pattern to avoid
Incident summaryCustomer impact, duration, severity, affected workflowsOnly technical symptoms and no user/business impact
TimelineDetection, response, recovery, and decision timestampsRetroactive timeline built from memory only
Contributing factorsLoad conditions, config drift, dependency health, governance modeSingle-cause storyline for complex failure chains
Policy path analysisDenied, deferred, quarantined, and fail-mode decisionsTreating all blocked jobs as generic platform failure
Corrective actionsOwner, due date, priority, and objective verification metricAction list with no ownership or closure signal

Cordum evidence map

ImplicationCurrent behaviorWhy it matters
User-impact severity signalsDispatch p99 > 1s and failed ratio > 10% are existing production thresholdsYou can map severity directly to known operational alert boundaries.
Policy dependency evidence`cordum_safety_unavailable_total` plus safety-kernel health checksShows whether the incident was dependency outage versus worker/executor pressure.
Output governance evidence`cordum_output_policy_quarantined_total` and job `failure_reason` samplesDistinguishes valid policy intervention from false-positive scanner behavior.
Replay integrity evidence`cordum_scheduler_orphan_replayed_total` trend and stale-jobs gaugeVerifies backlog recovery and highlights latent incident debt after service restore.
Distributed lock evidenceRedis lock keys such as `cordum:reconciler:default` and per-job lock keysConfirms single-writer assumptions during and after multi-replica incidents.

Implementation examples

Blameless postmortem template (Markdown)

postmortem-template.md
Markdown
# Incident Postmortem - Blameless Template

## 1) Incident Summary
- Incident ID:
- Severity:
- Start time / End time:
- Customer impact summary:
- Affected workflows/topics:

## 2) Detection and Timeline
- Detection source:
- First alert timestamp:
- Response milestones:
- Recovery timestamp:

## 3) Technical Impact Evidence
- Dispatch p99 during incident:
- Failed completion ratio:
- Stale jobs peak:
- Quarantine rate:

## 4) Policy Path Analysis
- safety_unavailable events:
- denied/quarantined event counts:
- POLICY_CHECK_FAIL_MODE at incident time:
- Any temporary policy overrides:

## 5) Contributing Factors (What/How framing)
- What conditions existed?
- How did they interact?
- What signals were missing or noisy?

## 6) Replay and Consistency Checks
- Orphan replay trend:
- Reconciler lock evidence:
- Duplicate side-effect check result:

## 7) Corrective Actions
| Action | Owner | Priority | Due Date | Verification Metric |
|--------|-------|----------|----------|---------------------|
|        |       |          |          |                     |

## 8) Lessons Learned
- What went well:
- What slowed response:
- What to change in runbooks/alerts:

Evidence query pack (PromQL)

postmortem-evidence.promql
PromQL
# Dispatch p99
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))

# Failed completion ratio
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)

# Safety dependency degradation
rate(cordum_safety_unavailable_total[5m])

# Output quarantine trend
rate(cordum_output_policy_quarantined_total[5m])

Corrective action schema (YAML)

postmortem-actions.yaml
YAML
postmortem_actions:
  - id: PM-001
    title: Add scheduler lock-state alert
    owner: platform-oncall
    priority: p1
    due_date: 2026-04-14
    verification:
      metric: cordum_scheduler_stale_jobs
      target: "<= 10 peak during weekly load test"
      evidence_link: "https://internal/wiki/load-test-2026-04-14"
  - id: PM-002
    title: Harden safety-kernel TLS health probes
    owner: security-platform
    priority: p1
    due_date: 2026-04-10
    verification:
      metric: cordum_safety_unavailable_total
      target: "0 sustained spikes in 14 days"
      evidence_link: "https://internal/wiki/safety-kernel-probe-rollout"

Limitations and tradeoffs

  • - Rich templates increase writing time; poor tooling makes this friction obvious.
  • - Overly strict template enforcement can suppress incident-specific nuance.
  • - Blameless language can be misused to avoid direct ownership if action tracking is weak.
  • - Metrics alone are insufficient; responder decision context still matters for learning.

Next step

Run this in one sprint:

  1. 1. Replace your current template with the structure above for all SEV-1 and SEV-2 incidents.
  2. 2. Require one policy-path evidence item and one replay-integrity evidence item per postmortem.
  3. 3. Enforce owner + due date + verification metric on every corrective action.
  4. 4. Review action closure rate after 30 days and adjust template fields where completion lags.

Continue with AI Agent Incident Response Runbook and AI Agent SLOs and Error Budgets.

The template is only half the work

Real improvement starts when action items are verified in production and closed on schedule.