Skip to content
Guide

AI Agent Rollback & Compensation Patterns (2026)

If an agent can change systems, it needs a tested undo path before execution.

Guide11 min readApr 2026
TL;DR
  • -Rollback for agent systems is a second workflow, not a database undo button.
  • -Cordum triggers saga rollback on `FAILED_FATAL`, with LIFO compensation and a per-workflow lock.
  • -Compensation quality depends on idempotency, safety behavior, and evidence you can query after the incident.
Compensation-first

Define how to undo before allowing side effects

Policy gate

Require approval or deny when rollback quality is weak

Bounded recovery

Use explicit lock and timeout windows for rollback execution

Scope

This guide targets autonomous workflows that touch external systems where partial success leaves real side effects.

The production problem

Partial failure is the default shape of distributed agent workflows. Payment may succeed while inventory rollback fails. Access may be granted while notification delivery times out.

Teams often discover this too late: forward workflow is automated, but compensation is manual, slow, and undocumented.

What top results miss

SourceStrong coverageMissing piece
AWS Saga patternsClear continuation vs compensation framing and choreography/orchestration split.No concrete governance model for high-risk autonomous actions before dispatch.
AWS serverless saga with Step FunctionsSolid end-to-end orchestration example with explicit compensation branches.Limited guidance on compensation safety-deny semantics and operator evidence trails.
Azure Compensating Transaction patternStrong treatment of idempotency, non-reverse ordering, and manual intervention paths.Not agent-control-plane specific; no pre-dispatch policy gate for autonomous workflows.

Compensation model for agents

LayerRequired designFailure if missing
Action contractEvery side-effecting action declares compensation or explicit no-compensation reason.Partial success with no deterministic recovery path.
IdempotencyCompensation requests carry stable keys and can be retried safely.Duplicate refunds, repeated reversals, inconsistent external state.
GovernanceHigh-risk actions without valid compensation are denied or require approval.Agent executes irreversible actions under degraded conditions.
ForensicsTimeline records include trigger status, compensation dispatches, skips, and reasons.Post-incident analysis turns into manual log archaeology.

Cordum runtime evidence

These behaviors are verified from current scheduler/saga code paths.

ControlCurrent behaviorEvidenceWhy it matters
Rollback trigger and timeout`FAILED_FATAL` with workflow ID starts saga rollback goroutine (30s timeout).engine.goRecovery is explicit and time-bounded.
LIFO compensation storageSuccessful forward steps push compensation templates to `saga:<workflow_id>:stack`.saga.go + scheduler-internals.mdRollback replays newest side effects first.
Rollback lockPer-workflow lock `saga:<workflow_id>:lock` with 2-minute TTL via `SETNX`.saga.goPrevents concurrent rollback races on the same workflow.
Compensation dispatch metadataCompensation jobs are tagged (`saga_compensation=true`, `is_compensation=true`) and forced to CRITICAL priority.saga.goImproves routing and observability for recovery traffic.
Compensation idempotencyIf no explicit key exists, scheduler derives one with `saga:<hash>` format.saga.goEnables safe replay of compensation paths.
Safety behavior during rollbackSafety deny skips compensation; safety unavailable proceeds with warning.saga.goAvoids hard blocking rollback, but can leave residual manual work.

Implementation examples

Rollback loop with lock and LIFO stack (Go)

saga.go
Go
func (s *SagaManager) Rollback(ctx context.Context, workflowID string) error {
  lockKey := fmt.Sprintf("saga:%s:lock", workflowID)
  ok, _ := s.redis.SetNX(ctx, lockKey, "1", 2*time.Minute).Result()
  if !ok { return nil } // rollback already active

  for {
    data, err := s.redis.LPop(ctx, fmt.Sprintf("saga:%s:stack", workflowID)).Bytes()
    if err == redis.Nil { break }
    // unmarshal + dispatch compensation request
  }
  return nil
}

Compensation idempotency key derivation (Go)

compensation_idempotency.go
Go
func compensationIdempotencyKey(base *pb.JobRequest, comp *pb.Compensation) string {
  seed := strings.Join([]string{
    base.WorkflowId, base.JobId, comp.Topic, base.Meta.Capability,
  }, "|")
  sum := sha256.Sum256([]byte(seed))
  return "saga:" + hex.EncodeToString(sum[:16])
}

Rollback evidence record (JSON)

rollback-evidence.json
JSON
{
  "workflow_id": "wf_19c1",
  "trigger_status": "FAILED_FATAL",
  "rollback_started": true,
  "lock_key": "saga:wf_19c1:lock",
  "compensation_dispatched": 3,
  "compensation_skipped_denied": 1,
  "compensation_failed_publish": 0
}

Limitations and tradeoffs

  • - Some actions are not truly reversible. Compensation then becomes mitigation, not restore.
  • - Safety-denied compensation steps can leave residual side effects that require manual operations.
  • - Rollback is time-bounded (30s in scheduler flow); long compensations may need separate recovery playbooks.
  • - Compensation logic must be tested as hard as forward execution or incident recovery will fail under load.

Next step

Run this rollout in one sprint:

  1. 1. Classify side-effecting topics as compensatable, mitigatable, or irreversible.
  2. 2. Block production dispatch when compensation metadata is missing.
  3. 3. Run weekly `FAILED_FATAL` drills and verify rollback evidence completeness.
  4. 4. Add manual incident branch for compensation-denied and compensation-timeout outcomes.

Continue with Approval Workflows for Autonomous AI Agents and AI Agent Incident Report.

Rollback is a product feature

If rollback exists only in diagrams, production will collect the debt with interest.