Skip to content
Guide

AI Agent Rollback & Compensation Patterns (2026)

If an agent can change systems, it needs a tested undo path before execution.

Guide11 min readUpdated June 2026

You do not roll back an AI agent action by undoing it — you run a second, compensating workflow. Use the saga pattern: every side-effecting step declares a compensation up front, and when a later step fails fatally the orchestrator replays those compensations in reverse (LIFO) order. In Cordum, a successful step pushes its compensation onto a per-workflow Redis stack, and a FAILED_FATAL result triggers the saga manager to pop and dispatch each compensation as a CRITICAL-priority job — gated by the Safety Kernel and recorded in the audit trail.

TL;DR
  • -Rollback for agent systems is a second workflow, not a database undo button.
  • -Cordum triggers saga rollback on `FAILED_FATAL`, with LIFO compensation and a per-workflow lock.
  • -Compensation quality depends on idempotency, safety behavior, and evidence you can query after the incident.
Compensation-first

Define how to undo before allowing side effects

Policy gate

Require approval or deny when rollback quality is weak

Bounded recovery

Use explicit lock and timeout windows for rollback execution

Scope

This guide targets autonomous workflows that touch external systems where partial success leaves real side effects.

The production problem

Partial failure is the default shape of distributed agent workflows. Payment may succeed while inventory rollback fails. Access may be granted while notification delivery times out.

Teams often discover this too late: forward workflow is automated, but compensation is manual, slow, and undocumented.

What top results miss

SourceStrong coverageMissing piece
AWS Saga patternsClear continuation vs compensation framing and choreography/orchestration split.No concrete governance model for high-risk autonomous actions before dispatch.
AWS serverless saga with Step FunctionsSolid end-to-end orchestration example with explicit compensation branches.Limited guidance on compensation safety-deny semantics and operator evidence trails.
Azure Compensating Transaction patternStrong treatment of idempotency, non-reverse ordering, and manual intervention paths.Not agent-control-plane specific; no pre-dispatch policy gate for autonomous workflows.

Compensation model for agents

LayerRequired designFailure if missing
Action contractEvery side-effecting action declares compensation or explicit no-compensation reason.Partial success with no deterministic recovery path.
IdempotencyCompensation requests carry stable keys and can be retried safely.Duplicate refunds, repeated reversals, inconsistent external state.
GovernanceHigh-risk actions without valid compensation are denied or require approval.Agent executes irreversible actions under degraded conditions.
ForensicsTimeline records include trigger status, compensation dispatches, skips, and reasons.Post-incident analysis turns into manual log archaeology.

Cordum runtime evidence

These behaviors are verified from current scheduler/saga code paths.

ControlCurrent behaviorEvidenceWhy it matters
Rollback trigger and timeout`FAILED_FATAL` with workflow ID starts saga rollback goroutine (30s timeout).engine.goRecovery is explicit and time-bounded.
LIFO compensation storageSuccessful forward steps push compensation templates to `saga:<workflow_id>:stack`.saga.go + scheduler-internals.mdRollback replays newest side effects first.
Rollback lockPer-workflow lock `saga:<workflow_id>:lock` with 2-minute TTL via `SETNX`.saga.goPrevents concurrent rollback races on the same workflow.
Compensation dispatch metadataCompensation jobs are tagged (`saga_compensation=true`, `is_compensation=true`) and forced to CRITICAL priority.saga.goImproves routing and observability for recovery traffic.
Compensation idempotencyIf no explicit key exists, scheduler derives one with `saga:<hash>` format.saga.goEnables safe replay of compensation paths.
Safety behavior during rollbackSafety deny skips compensation; safety unavailable proceeds with warning.saga.goAvoids hard blocking rollback, but can leave residual manual work.

Implementation examples

Rollback loop with lock and LIFO stack (Go)

saga.go
Go
func (s *SagaManager) Rollback(ctx context.Context, workflowID string) error {
  lockKey := fmt.Sprintf("saga:%s:lock", workflowID)
  ok, _ := s.redis.SetNX(ctx, lockKey, "1", 2*time.Minute).Result()
  if !ok { return nil } // rollback already active

  for {
    data, err := s.redis.LPop(ctx, fmt.Sprintf("saga:%s:stack", workflowID)).Bytes()
    if err == redis.Nil { break }
    // unmarshal + dispatch compensation request
  }
  return nil
}

Compensation idempotency key derivation (Go)

compensation_idempotency.go
Go
func compensationIdempotencyKey(base *pb.JobRequest, comp *pb.Compensation) string {
  // Capability comes from the compensation, falling back to the forward job.
  capability := comp.Meta.GetCapability()
  if capability == "" {
    capability = base.Meta.GetCapability()
  }
  step := ""
  if base.StepIndex != 0 {
    step = fmt.Sprintf("%d", base.StepIndex)
  }
  seed := strings.Trim(strings.Join([]string{
    base.WorkflowId, base.JobId, comp.Topic, capability, step,
  }, "|"), "|")
  sum := sha256.Sum256([]byte(seed))
  return "saga:" + hex.EncodeToString(sum[:16]) // stable key → safe replay
}

Rollback evidence record (JSON)

rollback-evidence.json
JSON
{
  "workflow_id": "wf_19c1",
  "trigger_status": "FAILED_FATAL",
  "rollback_started": true,
  "lock_key": "saga:wf_19c1:lock",
  "compensation_dispatched": 3,
  "compensation_skipped_denied": 1,
  "compensation_failed_publish": 0
}

Limitations and tradeoffs

  • - Some actions are not truly reversible. Compensation then becomes mitigation, not restore.
  • - Safety-denied compensation steps can leave residual side effects that require manual operations.
  • - Rollback is time-bounded (30s in scheduler flow); long compensations may need separate recovery playbooks.
  • - Compensation logic must be tested as hard as forward execution or incident recovery will fail under load.

FAQ

How do you roll back an AI agent action?

You do not undo it — you run a second, compensating workflow. Each side-effecting step declares a compensation action up front; when a later step fails fatally, the orchestrator replays those compensations in reverse (LIFO) order. In Cordum, a successful step pushes its compensation template onto a per-workflow Redis stack, and a FAILED_FATAL result triggers the saga manager to pop and dispatch each compensation as a CRITICAL-priority job.

What is the saga pattern for AI agents?

The saga pattern models a long-running, multi-step workflow as a sequence of local transactions, each paired with a compensating transaction that semantically reverses it. For AI agents it means every action an agent takes against an external system (a refund, an access grant, a ticket) carries a defined way to undo or mitigate it, so partial failures resolve to a consistent state instead of leaving orphaned side effects.

When does Cordum trigger a saga rollback?

Cordum's scheduler starts a rollback when a job in a workflow returns JOB_STATUS_FAILED_FATAL and the workflow ID is known. It launches a bounded rollback goroutine (30-second context), acquires a per-workflow Redis lock (saga:<workflow_id>:lock, 2-minute TTL via SETNX) so two rollbacks cannot race, then pops compensation templates off saga:<workflow_id>:stack and dispatches them newest-first.

What happens if a compensation step is itself unsafe?

Compensation dispatch is gated by the Safety Kernel. If the kernel returns DENY for a compensation job, that step is skipped and logged rather than forced through — which can leave residual side effects for an operator to resolve. If the kernel is unavailable, the compensation proceeds with a warning so rollback is not hard-blocked by a control-plane outage.

How is compensation made idempotent?

If a compensation does not carry an explicit idempotency key, Cordum derives a stable one by hashing the workflow ID, originating job ID, compensation topic, capability, and step index (saga:<sha256-prefix>). Replays of the same compensation therefore collapse to one effect, preventing duplicate refunds or repeated reversals.

Next step

Run this rollout in one sprint:

  1. 1. Classify side-effecting topics as compensatable, mitigatable, or irreversible.
  2. 2. Block production dispatch when compensation metadata is missing.
  3. 3. Run weekly `FAILED_FATAL` drills and verify rollback evidence completeness.
  4. 4. Add manual incident branch for compensation-denied and compensation-timeout outcomes.

Continue with Approval Workflows for Autonomous AI Agents and AI Agent Incident Report, or see how compensation fits the broader control plane in AI Agent Governance for Production.

Rollback is a product feature

If rollback exists only in diagrams, production will collect the debt with interest.