Skip to content
Guide

AI Agent Rollback and Compensation in Production

If an agent can change systems, it needs a tested undo path before execution.

Guide11 min readMar 2026
TL;DR
  • -Rollback for AI agents is not a database undo. It is a second workflow with its own failure modes.
  • -If a high-risk action has no compensation contract, block it before dispatch.
  • -Recovery quality depends on idempotency keys, ordering, and timeline evidence, not hope.
Compensation-first

Define how to undo before allowing side effects

Policy gate

Require approval when rollback quality is low

Failure budget

Use explicit timeouts and bounded retries

Scope

This guide focuses on autonomous agents that execute real actions across infrastructure, internal APIs, and external tools where failures can leave lasting side effects.

The production problem

Retrying a failed LLM call is cheap. Undoing a partial payment, access change, or ticket update is not. Agent workflows break in the middle, and each completed step can leave side effects in another system.

Teams often discover this after incident day one: orchestration logic exists, but rollback policy, idempotency discipline, and evidence trails are missing.

What top results cover and miss

SourceStrong coverageMissing piece
AWS Prompt chaining and saga patternsClear comparison of prompt chaining and saga orchestration in agentic workflows.Limited guidance on policy-gated rollback for irreversible external actions.
AWS Saga orchestration patternsGood overview of central orchestrators, retries, and multistep agent delegation.Does not specify rollback evidence model for audits and incident forensics.
Azure Compensating Transaction patternStrong design principles for idempotent compensation and partial undo logic.Not agent-specific, and light on pre-dispatch governance for autonomous actions.

Compensation model for agents

LayerRequired designFailure if missing
Action contractEvery side-effecting action declares a compensation action or explicit no-rollback reason.Silent partial failure with no recovery path.
IdempotencyForward and compensation paths both carry stable idempotency keys.Duplicate restores, double refunds, repeated API side effects.
Policy gateHigh-risk actions without valid compensation are denied or require human approval.Agent runs risky actions faster than operators can react.
Evidence timelineRecord who dispatched, what policy matched, and what compensation ran.Post-incident analysis becomes guesswork.

Cordum runtime behavior

In Cordum, `FAILED_FATAL` on a workflow job triggers saga rollback with a compensation stack. This moves rollback from docs to runtime behavior.

ControlCurrent behaviorWhy it matters
Dispatch retries50 max scheduling retries, exponential backoff from 1s to 30sPrevents infinite retry storms while still tolerating transient faults.
Rollback trigger`FAILED_FATAL` on workflow job triggers saga rollbackRecovery starts from an explicit terminal failure signal, not implicit heuristics.
Rollback lockPer-workflow lock `saga:<workflow_id>:lock` with 2-minute TTLAvoids concurrent rollback races on the same workflow.
Rollback execution windowRollback goroutine runs with 30-second timeoutBounds blast radius when compensation path is degraded.

Implementation examples

Compensation-first execution loop (Go)

rollback.go
Go
type Step struct {
  Name         string
  Do           func(context.Context) error
  Compensate   func(context.Context) error
  Idempotency  string
}

func RunWithCompensation(ctx context.Context, steps []Step) error {
  applied := make([]Step, 0, len(steps))

  for _, step := range steps {
    if err := step.Do(ctx); err != nil {
      for i := len(applied) - 1; i >= 0; i-- {
        _ = applied[i].Compensate(ctx) // collect and report in production
      }
      return err
    }
    applied = append(applied, step)
  }

  return nil
}

Pre-dispatch rollback policy gate (YAML)

policy.yaml
YAML
version: v1
rules:
  - id: require-compensation-for-prod-writes
    when:
      env: production
      side_effect: true
      compensation_declared: false
    decision: require_human

  - id: deny-irreversible-delete-without-approval
    when:
      topic: infra.delete
      rollback_supported: false
      approval_present: false
    decision: deny

Rollback evidence record (JSON)

rollback-evidence.json
JSON
{
  "run_id": "run_9c2a",
  "job_id": "job_42",
  "status": "FAILED_FATAL",
  "matched_policy_id": "require-compensation-for-prod-writes",
  "rollback_triggered": true,
  "rollback_mode": "lifo_stack",
  "compensation_jobs_dispatched": 3,
  "compensation_jobs_skipped": 1,
  "rollback_finished_at": "2026-03-31T13:40:18Z"
}

Limitations and tradeoffs

  • - Some actions are not reversible. Compensation is then business mitigation, not true rollback.
  • - Strict approval gates reduce blast radius but can add operational latency during incidents.
  • - Compensation paths need their own testing budget, or they fail exactly when you need them.
  • - Soft safety checks during rollback can skip denied compensation steps, which requires manual follow-up.

Next step

Run this 4-week rollout:

  1. 1. Tag all side-effecting agent topics as reversible, compensatable, or irreversible.
  2. 2. Block production dispatch when compensation metadata is missing.
  3. 3. Rehearse one `FAILED_FATAL` drill each week and verify full timeline evidence.
  4. 4. Add a manual incident branch for compensation-denied and compensation-timeout cases.

Continue with Approval Workflows for Autonomous AI Agents and AI Agent Incident Report.

Rollback is a product feature

If rollback works only in architecture diagrams, production will eventually price that mistake for you.