Name: Cordum
Author: Cordum

The production problem

Retrying a failed LLM call is cheap. Undoing a partial payment, access change, or ticket update is not. Agent workflows break in the middle, and each completed step can leave side effects in another system.

Teams often discover this after incident day one: orchestration logic exists, but rollback policy, idempotency discipline, and evidence trails are missing.

What top results cover and miss

Source	Strong coverage	Missing piece
AWS Prompt chaining and saga patterns	Clear comparison of prompt chaining and saga orchestration in agentic workflows.	Limited guidance on policy-gated rollback for irreversible external actions.
AWS Saga orchestration patterns	Good overview of central orchestrators, retries, and multistep agent delegation.	Does not specify rollback evidence model for audits and incident forensics.
Azure Compensating Transaction pattern	Strong design principles for idempotent compensation and partial undo logic.	Not agent-specific, and light on pre-dispatch governance for autonomous actions.

Compensation model for agents

Layer	Required design	Failure if missing
Action contract	Every side-effecting action declares a compensation action or explicit no-rollback reason.	Silent partial failure with no recovery path.
Idempotency	Forward and compensation paths both carry stable idempotency keys.	Duplicate restores, double refunds, repeated API side effects.
Policy gate	High-risk actions without valid compensation are denied or require human approval.	Agent runs risky actions faster than operators can react.
Evidence timeline	Record who dispatched, what policy matched, and what compensation ran.	Post-incident analysis becomes guesswork.

Cordum runtime behavior

In Cordum, `FAILED_FATAL` on a workflow job triggers saga rollback with a compensation stack. This moves rollback from docs to runtime behavior.

Control	Current behavior	Why it matters
Dispatch retries	50 max scheduling retries, exponential backoff from 1s to 30s	Prevents infinite retry storms while still tolerating transient faults.
Rollback trigger	`FAILED_FATAL` on workflow job triggers saga rollback	Recovery starts from an explicit terminal failure signal, not implicit heuristics.
Rollback lock	Per-workflow lock `saga:<workflow_id>:lock` with 2-minute TTL	Avoids concurrent rollback races on the same workflow.
Rollback execution window	Rollback goroutine runs with 30-second timeout	Bounds blast radius when compensation path is degraded.

Implementation examples

Compensation-first execution loop (Go)

rollback.go

type Step struct {
  Name         string
  Do           func(context.Context) error
  Compensate   func(context.Context) error
  Idempotency  string
}

func RunWithCompensation(ctx context.Context, steps []Step) error {
  applied := make([]Step, 0, len(steps))

  for _, step := range steps {
    if err := step.Do(ctx); err != nil {
      for i := len(applied) - 1; i >= 0; i-- {
        _ = applied[i].Compensate(ctx) // collect and report in production
      }
      return err
    }
    applied = append(applied, step)
  }

  return nil
}

Pre-dispatch rollback policy gate (YAML)

policy.yaml

YAML

version: v1
rules:
  - id: require-compensation-for-prod-writes
    when:
      env: production
      side_effect: true
      compensation_declared: false
    decision: require_human

  - id: deny-irreversible-delete-without-approval
    when:
      topic: infra.delete
      rollback_supported: false
      approval_present: false
    decision: deny

Rollback evidence record (JSON)

rollback-evidence.json

JSON

{
  "run_id": "run_9c2a",
  "job_id": "job_42",
  "status": "FAILED_FATAL",
  "matched_policy_id": "require-compensation-for-prod-writes",
  "rollback_triggered": true,
  "rollback_mode": "lifo_stack",
  "compensation_jobs_dispatched": 3,
  "compensation_jobs_skipped": 1,
  "rollback_finished_at": "2026-03-31T13:40:18Z"
}

Limitations and tradeoffs

- Some actions are not reversible. Compensation is then business mitigation, not true rollback.
- Strict approval gates reduce blast radius but can add operational latency during incidents.
- Compensation paths need their own testing budget, or they fail exactly when you need them.
- Soft safety checks during rollback can skip denied compensation steps, which requires manual follow-up.

Next step

Run this 4-week rollout:

1. Tag all side-effecting agent topics as reversible, compensatable, or irreversible.
2. Block production dispatch when compensation metadata is missing.
3. Rehearse one `FAILED_FATAL` drill each week and verify full timeline evidence.
4. Add a manual incident branch for compensation-denied and compensation-timeout cases.

Continue with Approval Workflows for Autonomous AI Agents and AI Agent Incident Report.

AI Agent Rollback and Compensation in Production