The production problem
Retrying a failed LLM call is cheap. Undoing a partial payment, access change, or ticket update is not. Agent workflows break in the middle, and each completed step can leave side effects in another system.
Teams often discover this after incident day one: orchestration logic exists, but rollback policy, idempotency discipline, and evidence trails are missing.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Prompt chaining and saga patterns | Clear comparison of prompt chaining and saga orchestration in agentic workflows. | Limited guidance on policy-gated rollback for irreversible external actions. |
| AWS Saga orchestration patterns | Good overview of central orchestrators, retries, and multistep agent delegation. | Does not specify rollback evidence model for audits and incident forensics. |
| Azure Compensating Transaction pattern | Strong design principles for idempotent compensation and partial undo logic. | Not agent-specific, and light on pre-dispatch governance for autonomous actions. |
Compensation model for agents
| Layer | Required design | Failure if missing |
|---|---|---|
| Action contract | Every side-effecting action declares a compensation action or explicit no-rollback reason. | Silent partial failure with no recovery path. |
| Idempotency | Forward and compensation paths both carry stable idempotency keys. | Duplicate restores, double refunds, repeated API side effects. |
| Policy gate | High-risk actions without valid compensation are denied or require human approval. | Agent runs risky actions faster than operators can react. |
| Evidence timeline | Record who dispatched, what policy matched, and what compensation ran. | Post-incident analysis becomes guesswork. |
Cordum runtime behavior
In Cordum, `FAILED_FATAL` on a workflow job triggers saga rollback with a compensation stack. This moves rollback from docs to runtime behavior.
| Control | Current behavior | Why it matters |
|---|---|---|
| Dispatch retries | 50 max scheduling retries, exponential backoff from 1s to 30s | Prevents infinite retry storms while still tolerating transient faults. |
| Rollback trigger | `FAILED_FATAL` on workflow job triggers saga rollback | Recovery starts from an explicit terminal failure signal, not implicit heuristics. |
| Rollback lock | Per-workflow lock `saga:<workflow_id>:lock` with 2-minute TTL | Avoids concurrent rollback races on the same workflow. |
| Rollback execution window | Rollback goroutine runs with 30-second timeout | Bounds blast radius when compensation path is degraded. |
Implementation examples
Compensation-first execution loop (Go)
type Step struct {
Name string
Do func(context.Context) error
Compensate func(context.Context) error
Idempotency string
}
func RunWithCompensation(ctx context.Context, steps []Step) error {
applied := make([]Step, 0, len(steps))
for _, step := range steps {
if err := step.Do(ctx); err != nil {
for i := len(applied) - 1; i >= 0; i-- {
_ = applied[i].Compensate(ctx) // collect and report in production
}
return err
}
applied = append(applied, step)
}
return nil
}Pre-dispatch rollback policy gate (YAML)
version: v1
rules:
- id: require-compensation-for-prod-writes
when:
env: production
side_effect: true
compensation_declared: false
decision: require_human
- id: deny-irreversible-delete-without-approval
when:
topic: infra.delete
rollback_supported: false
approval_present: false
decision: denyRollback evidence record (JSON)
{
"run_id": "run_9c2a",
"job_id": "job_42",
"status": "FAILED_FATAL",
"matched_policy_id": "require-compensation-for-prod-writes",
"rollback_triggered": true,
"rollback_mode": "lifo_stack",
"compensation_jobs_dispatched": 3,
"compensation_jobs_skipped": 1,
"rollback_finished_at": "2026-03-31T13:40:18Z"
}Limitations and tradeoffs
- - Some actions are not reversible. Compensation is then business mitigation, not true rollback.
- - Strict approval gates reduce blast radius but can add operational latency during incidents.
- - Compensation paths need their own testing budget, or they fail exactly when you need them.
- - Soft safety checks during rollback can skip denied compensation steps, which requires manual follow-up.
Next step
Run this 4-week rollout:
- 1. Tag all side-effecting agent topics as reversible, compensatable, or irreversible.
- 2. Block production dispatch when compensation metadata is missing.
- 3. Rehearse one `FAILED_FATAL` drill each week and verify full timeline evidence.
- 4. Add a manual incident branch for compensation-denied and compensation-timeout cases.
Continue with Approval Workflows for Autonomous AI Agents and AI Agent Incident Report.